nyc.jpg

An Analysis of Airbnb Data from New York City

Megha Guggari, Rohit Mandavia, Ngan Nguyen

Introduction

Airbnb is a popular tool that has made travel easy with simple/straight-forward room bookings all over the world (fun fact there are over 6 million Airbnb listings worldwide!) However, it is always a headache when trying to figure out the best place to book an Airbnb because you have to factor in things such as price, ratings, availability, and area. Our group chose to analyze Airbnb data from NYC after we realized that we were all travelling to NYC after our exams! We thought about how time-consuming it was to find the perfect Airbnb - one with great reviews, a great price, and one that was actually available! In general, we figured that travelling to NYC is pretty common, and we thought it would be useful to visualize things such as prices, ratings, and availabilities per neighborhood to make room bookings easier.

Airbnb releases open data for different cities. The data set we used can be found here: NYC Open Airbnb Data

In this tutorial, you will be able to see visualizations such as how price relates to neighborhood, how availabilities relate to areas in the city, and how ratings relate to price/neighborhoods, to name a few. This information will hopefully make it easier to make informed decisions about the best place to book an Airbnb!

Outline of project:

  1. Data Collection
  2. Data Preprocessing
  3. Data Visualization
  4. Classification/Prediction
  5. Conclusion

Required Libraries/Tools

For this project, we used the following packages:

  1. Matplotlib
  2. Pandas
  3. Folium
  4. Numpy
  5. Sklearn Linear Regression
  6. Math
  7. Seaborn
In [2]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import pandas
import folium
from folium import plugins
from folium.plugins import HeatMap
import numpy as np
from sklearn import linear_model
import math
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

Part 1 - Data Collection

For our data collection, we chose a dataset from Kaggle that contained open data from NYC Airbnb.

Data explanation:

  • id: id of the Airbnb
  • name: description of the Airbnb
  • host_id: id of the host
  • host_name: Name of the host
  • neighbourhood_group: Neighbourhoods were grouped into 5 groups including:
    • Brooklyn
    • Manhattan
    • Bronx
    • Staten Island
    • Queens
  • neighbourhood: Specific neighbourhood name
  • latitude and longitude
  • room_type: type of Airbnb rental including:
    • Private Room
    • Entire Home/Apartment
    • Shared Room
  • price: Price of the Airbnb for one night
  • minimum_nights: minimum number of nights that Airbnb was available
  • number_of_reviews: number of reviews for the Airbnb
  • last_review: date of last review
  • reviews_per_month: Number of reviews per month (a ratio)
  • calculated_host_listings_count: amount of listings per host
  • availability_365: number of days available out of the year (out of 365 days)
In [3]:
data = pandas.read_csv("AB_NYC_2019.csv")
data
Out[3]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
0 2539 Clean & quiet apt home by the park 2787 John Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 2018-10-19 0.21 6 365
1 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 2019-05-21 0.38 2 355
2 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth Manhattan Harlem 40.80902 -73.94190 Private room 150 3 0 NaN NaN 1 365
3 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 2019-07-05 4.64 1 194
4 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 2018-11-19 0.10 1 0
5 5099 Large Cozy 1 BR Apartment In Midtown East 7322 Chris Manhattan Murray Hill 40.74767 -73.97500 Entire home/apt 200 3 74 2019-06-22 0.59 1 129
6 5121 BlissArtsSpace! 7356 Garon Brooklyn Bedford-Stuyvesant 40.68688 -73.95596 Private room 60 45 49 2017-10-05 0.40 1 0
7 5178 Large Furnished Room Near B'way 8967 Shunichi Manhattan Hell's Kitchen 40.76489 -73.98493 Private room 79 2 430 2019-06-24 3.47 1 220
8 5203 Cozy Clean Guest Room - Family Apt 7490 MaryEllen Manhattan Upper West Side 40.80178 -73.96723 Private room 79 2 118 2017-07-21 0.99 1 0
9 5238 Cute & Cozy Lower East Side 1 bdrm 7549 Ben Manhattan Chinatown 40.71344 -73.99037 Entire home/apt 150 1 160 2019-06-09 1.33 4 188
10 5295 Beautiful 1br on Upper West Side 7702 Lena Manhattan Upper West Side 40.80316 -73.96545 Entire home/apt 135 5 53 2019-06-22 0.43 1 6
11 5441 Central Manhattan/near Broadway 7989 Kate Manhattan Hell's Kitchen 40.76076 -73.98867 Private room 85 2 188 2019-06-23 1.50 1 39
12 5803 Lovely Room 1, Garden, Best Area, Legal rental 9744 Laurie Brooklyn South Slope 40.66829 -73.98779 Private room 89 4 167 2019-06-24 1.34 3 314
13 6021 Wonderful Guest Bedroom in Manhattan for SINGLES 11528 Claudio Manhattan Upper West Side 40.79826 -73.96113 Private room 85 2 113 2019-07-05 0.91 1 333
14 6090 West Village Nest - Superhost 11975 Alina Manhattan West Village 40.73530 -74.00525 Entire home/apt 120 90 27 2018-10-31 0.22 1 0
15 6848 Only 2 stops to Manhattan studio 15991 Allen & Irina Brooklyn Williamsburg 40.70837 -73.95352 Entire home/apt 140 2 148 2019-06-29 1.20 1 46
16 7097 Perfect for Your Parents + Garden 17571 Jane Brooklyn Fort Greene 40.69169 -73.97185 Entire home/apt 215 2 198 2019-06-28 1.72 1 321
17 7322 Chelsea Perfect 18946 Doti Manhattan Chelsea 40.74192 -73.99501 Private room 140 1 260 2019-07-01 2.12 1 12
18 7726 Hip Historic Brownstone Apartment with Backyard 20950 Adam And Charity Brooklyn Crown Heights 40.67592 -73.94694 Entire home/apt 99 3 53 2019-06-22 4.44 1 21
19 7750 Huge 2 BR Upper East Cental Park 17985 Sing Manhattan East Harlem 40.79685 -73.94872 Entire home/apt 190 7 0 NaN NaN 2 249
20 7801 Sweet and Spacious Brooklyn Loft 21207 Chaya Brooklyn Williamsburg 40.71842 -73.95718 Entire home/apt 299 3 9 2011-12-28 0.07 1 0
21 8024 CBG CtyBGd HelpsHaiti rm#1:1-4 22486 Lisel Brooklyn Park Slope 40.68069 -73.97706 Private room 130 2 130 2019-07-01 1.09 6 347
22 8025 CBG Helps Haiti Room#2.5 22486 Lisel Brooklyn Park Slope 40.67989 -73.97798 Private room 80 1 39 2019-01-01 0.37 6 364
23 8110 CBG Helps Haiti Rm #2 22486 Lisel Brooklyn Park Slope 40.68001 -73.97865 Private room 110 2 71 2019-07-02 0.61 6 304
24 8490 MAISON DES SIRENES1,bohemian apartment 25183 Nathalie Brooklyn Bedford-Stuyvesant 40.68371 -73.94028 Entire home/apt 120 2 88 2019-06-19 0.73 2 233
25 8505 Sunny Bedroom Across Prospect Park 25326 Gregory Brooklyn Windsor Terrace 40.65599 -73.97519 Private room 60 1 19 2019-06-23 1.37 2 85
26 8700 Magnifique Suite au N de Manhattan - vue Cloitres 26394 Claude & Sophie Manhattan Inwood 40.86754 -73.92639 Private room 80 4 0 NaN NaN 1 0
27 9357 Midtown Pied-a-terre 30193 Tommi Manhattan Hell's Kitchen 40.76715 -73.98533 Entire home/apt 150 10 58 2017-08-13 0.49 1 75
28 9518 SPACIOUS, LOVELY FURNISHED MANHATTAN BEDROOM 31374 Shon Manhattan Inwood 40.86482 -73.92106 Private room 44 3 108 2019-06-15 1.11 3 311
29 9657 Modern 1 BR / NYC / EAST VILLAGE 21904 Dana Manhattan East Village 40.72920 -73.98542 Entire home/apt 180 14 29 2019-04-19 0.24 1 67
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48865 36472171 1 bedroom in sunlit apartment 99144947 Brenda Manhattan Inwood 40.86845 -73.92449 Private room 80 1 0 NaN NaN 1 79
48866 36472710 CozyHideAway Suite 274225617 Alberth Queens Briarwood 40.70786 -73.81448 Entire home/apt 58 1 0 NaN NaN 1 159
48867 36473044 The place you were dreaming for.(only for guys) 261338177 Diana Brooklyn Gravesend 40.59080 -73.97116 Shared room 25 1 0 NaN NaN 6 338
48868 36473253 Heaven for you(only for guy) 261338177 Diana Brooklyn Gravesend 40.59118 -73.97119 Shared room 25 7 0 NaN NaN 6 365
48869 36474023 Cozy, Sunny Brooklyn Escape 1550580 Julia Brooklyn Bedford-Stuyvesant 40.68759 -73.95705 Private room 45 4 0 NaN NaN 1 7
48870 36474911 Cozy, clean Williamsburg 1- bedroom apartment 1273444 Tanja Brooklyn Williamsburg 40.71197 -73.94946 Entire home/apt 99 4 0 NaN NaN 1 22
48871 36475746 A LARGE ROOM - 1 MONTH MINIMUM - WASHER&DRYER 144008701 Ozzy Ciao Manhattan Harlem 40.82233 -73.94687 Private room 35 29 0 NaN NaN 2 31
48872 36476675 Nycity-MyHome 8636072 Ben Manhattan Hell's Kitchen 40.76236 -73.99255 Entire home/apt 260 3 0 NaN NaN 1 9
48873 36477307 Brooklyn paradise 241945355 Clement & Rose Brooklyn Flatlands 40.63116 -73.92616 Entire home/apt 170 1 0 NaN NaN 2 363
48874 36477588 Short Term Rental in East Harlem 214535893 Jeffrey Manhattan East Harlem 40.79760 -73.93947 Private room 50 7 0 NaN NaN 1 22
48875 36478343 Welcome all as family 274273284 Anastasia Manhattan East Harlem 40.78749 -73.94749 Private room 140 1 0 NaN NaN 1 180
48876 36478357 Cozy, Air-Conditioned Private Bedroom in Harlem 177932088 Joseph Manhattan Harlem 40.80953 -73.95410 Private room 60 1 0 NaN NaN 1 26
48877 36479230 Studio sized room with beautiful light 65767720 Melanie Brooklyn Bushwick 40.70418 -73.91471 Private room 42 7 0 NaN NaN 1 16
48878 36479723 Room for rest 41326856 Jeerathinan Queens Elmhurst 40.74477 -73.87727 Private room 45 1 0 NaN NaN 5 172
48879 36480292 Gorgeous 1.5 Bdr with a private yard- Williams... 540335 Lee Brooklyn Williamsburg 40.71728 -73.94394 Entire home/apt 120 20 0 NaN NaN 1 22
48880 36481315 The Raccoon Artist Studio in Williamsburg New ... 208514239 Melki Brooklyn Williamsburg 40.71232 -73.94220 Entire home/apt 120 1 0 NaN NaN 3 365
48881 36481615 Peaceful space in Greenpoint, BK 274298453 Adrien Brooklyn Greenpoint 40.72585 -73.94001 Private room 54 6 0 NaN NaN 1 15
48882 36482231 Bushwick _ Myrtle-Wyckoff 66058896 Luisa Brooklyn Bushwick 40.69652 -73.91079 Private room 40 20 0 NaN NaN 1 31
48883 36482416 Sunny Bedroom NYC! Walking to Central Park!! 131529729 Kendall Manhattan East Harlem 40.79755 -73.93614 Private room 75 2 0 NaN NaN 2 364
48884 36482783 Brooklyn Oasis in the heart of Williamsburg 274307600 Jonathan Brooklyn Williamsburg 40.71790 -73.96238 Private room 190 7 0 NaN NaN 1 341
48885 36482809 Stunning Bedroom NYC! Walking to Central Park!! 131529729 Kendall Manhattan East Harlem 40.79633 -73.93605 Private room 75 2 0 NaN NaN 2 353
48886 36483010 Comfy 1 Bedroom in Midtown East 274311461 Scott Manhattan Midtown 40.75561 -73.96723 Entire home/apt 200 6 0 NaN NaN 1 176
48887 36483152 Garden Jewel Apartment in Williamsburg New York 208514239 Melki Brooklyn Williamsburg 40.71232 -73.94220 Entire home/apt 170 1 0 NaN NaN 3 365
48888 36484087 Spacious Room w/ Private Rooftop, Central loca... 274321313 Kat Manhattan Hell's Kitchen 40.76392 -73.99183 Private room 125 4 0 NaN NaN 1 31
48889 36484363 QUIT PRIVATE HOUSE 107716952 Michael Queens Jamaica 40.69137 -73.80844 Private room 65 1 0 NaN NaN 2 163
48890 36484665 Charming one bedroom - newly renovated rowhouse 8232441 Sabrina Brooklyn Bedford-Stuyvesant 40.67853 -73.94995 Private room 70 2 0 NaN NaN 2 9
48891 36485057 Affordable room in Bushwick/East Williamsburg 6570630 Marisol Brooklyn Bushwick 40.70184 -73.93317 Private room 40 4 0 NaN NaN 2 36
48892 36485431 Sunny Studio at Historical Neighborhood 23492952 Ilgar & Aysel Manhattan Harlem 40.81475 -73.94867 Entire home/apt 115 10 0 NaN NaN 1 27
48893 36485609 43rd St. Time Square-cozy single bed 30985759 Taz Manhattan Hell's Kitchen 40.75751 -73.99112 Shared room 55 1 0 NaN NaN 6 2
48894 36487245 Trendy duplex in the very heart of Hell's Kitchen 68119814 Christophe Manhattan Hell's Kitchen 40.76404 -73.98933 Private room 90 7 0 NaN NaN 1 23

48895 rows × 16 columns

Part 2 - Preprocessing

We chose to exclude some columns that were not very informative (ones that we did not think we needed in our analysis). The columns we chose to drop were as follows:

  1. id
  2. name
  3. host_id
  4. last_review
  5. calculated_host_listings_count

As our second preprocessing step, we chose to eliminate rows based on if the prices seemed unreasoable. This included if the prices were listed anywhere between 0-25 (a price that low seemed less common for NYC, especially), or if the prices were listed to be above 250 per night (as college students we wanted to keep prices that were more common).

In [4]:
data = data.drop(columns=['id', 'name', 'host_id', 'host_name', 'last_review', 'calculated_host_listings_count'])

data = data[data['price'] >= 25]
data = data[data['price'] <= 250]

data = data.dropna()
data
Out[4]:
neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews reviews_per_month availability_365
0 Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 0.21 365
1 Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 0.38 355
3 Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 4.64 194
4 Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 0.10 0
5 Manhattan Murray Hill 40.74767 -73.97500 Entire home/apt 200 3 74 0.59 129
6 Brooklyn Bedford-Stuyvesant 40.68688 -73.95596 Private room 60 45 49 0.40 0
7 Manhattan Hell's Kitchen 40.76489 -73.98493 Private room 79 2 430 3.47 220
8 Manhattan Upper West Side 40.80178 -73.96723 Private room 79 2 118 0.99 0
9 Manhattan Chinatown 40.71344 -73.99037 Entire home/apt 150 1 160 1.33 188
10 Manhattan Upper West Side 40.80316 -73.96545 Entire home/apt 135 5 53 0.43 6
11 Manhattan Hell's Kitchen 40.76076 -73.98867 Private room 85 2 188 1.50 39
12 Brooklyn South Slope 40.66829 -73.98779 Private room 89 4 167 1.34 314
13 Manhattan Upper West Side 40.79826 -73.96113 Private room 85 2 113 0.91 333
14 Manhattan West Village 40.73530 -74.00525 Entire home/apt 120 90 27 0.22 0
15 Brooklyn Williamsburg 40.70837 -73.95352 Entire home/apt 140 2 148 1.20 46
16 Brooklyn Fort Greene 40.69169 -73.97185 Entire home/apt 215 2 198 1.72 321
17 Manhattan Chelsea 40.74192 -73.99501 Private room 140 1 260 2.12 12
18 Brooklyn Crown Heights 40.67592 -73.94694 Entire home/apt 99 3 53 4.44 21
21 Brooklyn Park Slope 40.68069 -73.97706 Private room 130 2 130 1.09 347
22 Brooklyn Park Slope 40.67989 -73.97798 Private room 80 1 39 0.37 364
23 Brooklyn Park Slope 40.68001 -73.97865 Private room 110 2 71 0.61 304
24 Brooklyn Bedford-Stuyvesant 40.68371 -73.94028 Entire home/apt 120 2 88 0.73 233
25 Brooklyn Windsor Terrace 40.65599 -73.97519 Private room 60 1 19 1.37 85
27 Manhattan Hell's Kitchen 40.76715 -73.98533 Entire home/apt 150 10 58 0.49 75
28 Manhattan Inwood 40.86482 -73.92106 Private room 44 3 108 1.11 311
29 Manhattan East Village 40.72920 -73.98542 Entire home/apt 180 14 29 0.24 67
30 Manhattan Harlem 40.82245 -73.95104 Private room 50 3 242 2.04 355
31 Manhattan Harlem 40.81305 -73.95466 Private room 52 2 88 1.42 255
32 Brooklyn Greenpoint 40.72219 -73.93762 Private room 55 4 197 1.65 284
33 Manhattan Harlem 40.82130 -73.95318 Private room 50 3 273 2.37 359
... ... ... ... ... ... ... ... ... ... ...
48380 Queens Rockaway Beach 40.58790 -73.81269 Entire home/apt 99 1 1 1.00 162
48384 Brooklyn Fort Greene 40.68889 -73.97632 Private room 130 6 1 1.00 29
48387 Brooklyn Bushwick 40.70384 -73.92232 Private room 40 2 1 1.00 16
48391 Brooklyn Williamsburg 40.71232 -73.94220 Entire home/apt 170 1 1 1.00 340
48392 Manhattan Harlem 40.80658 -73.95736 Private room 138 3 1 1.00 65
48394 Brooklyn Cypress Hills 40.68042 -73.88978 Private room 75 1 1 1.00 82
48401 Staten Island Rosebank 40.60750 -74.07979 Private room 65 1 1 1.00 179
48404 Brooklyn Williamsburg 40.71773 -73.94159 Private room 91 2 1 1.00 344
48409 Brooklyn Midwood 40.62555 -73.95724 Private room 50 1 1 1.00 86
48410 Brooklyn Williamsburg 40.70518 -73.93794 Private room 100 1 1 1.00 1
48413 Manhattan Upper East Side 40.76996 -73.95117 Private room 220 4 1 1.00 15
48453 Manhattan Kips Bay 40.73929 -73.98183 Private room 135 2 1 1.00 41
48454 Manhattan Upper East Side 40.77551 -73.95404 Private room 120 1 1 1.00 358
48457 Manhattan Financial District 40.70583 -74.01038 Entire home/apt 170 1 1 1.00 323
48524 Brooklyn Park Slope 40.67455 -73.98477 Entire home/apt 150 1 1 1.00 8
48526 Queens Bayswater 40.60804 -73.75829 Private room 45 3 1 1.00 90
48532 Queens Maspeth 40.71327 -73.90982 Private room 65 2 1 1.00 21
48534 Brooklyn Bedford-Stuyvesant 40.68969 -73.93285 Private room 68 1 1 1.00 127
48576 Manhattan Midtown 40.75286 -73.99297 Private room 120 2 1 1.00 7
48601 Manhattan Financial District 40.70603 -74.01084 Entire home/apt 75 1 1 1.00 181
48615 Queens Astoria 40.76887 -73.91128 Private room 150 1 1 1.00 165
48634 Manhattan Upper West Side 40.80281 -73.96550 Entire home/apt 110 3 2 2.00 15
48636 Brooklyn Bedford-Stuyvesant 40.68914 -73.92408 Private room 33 30 2 2.00 87
48701 Brooklyn Bedford-Stuyvesant 40.69551 -73.93951 Private room 45 1 2 2.00 14
48732 Manhattan Lower East Side 40.71825 -73.99019 Entire home/apt 150 4 1 1.00 13
48782 Manhattan Upper East Side 40.78099 -73.95366 Private room 129 1 1 1.00 147
48790 Queens Flushing 40.75104 -73.81459 Private room 45 1 1 1.00 339
48799 Staten Island Great Kills 40.54179 -74.14275 Private room 235 1 1 1.00 87
48805 Bronx Mott Haven 40.80787 -73.92400 Entire home/apt 100 1 2 2.00 40
48852 Brooklyn Bushwick 40.69805 -73.92801 Private room 30 1 1 1.00 1

35217 rows × 10 columns

Part 3 - Visualization

The different things we chose to visualize are as follows:

  1. Room Type vs Price
  2. Room Type vs Availability
  3. Map based on Rating vs Area
  4. Map based on Price vs Area
  5. Map based on Availability vs Area
In [5]:
#1 Room Type vs Price
data_by_roomtype = data.sort_values(["room_type"])
room_types = data_by_roomtype["room_type"].unique()


sums = {"entire": 0, "private": 0, "shared": 0}
tally = {"entire": 0, "private": 0, "shared": 0}
averages = {}

def generate_bar_plot(row, sortBy):
    global sums, tally
    if(row["room_type"]=="Entire home/apt"):
        sums["entire"]+=row[sortBy]
        tally["entire"]+=1
    elif(row["room_type"]=="Private room"):
        sums["private"]+=row[sortBy]
        tally["private"]+=1
    elif(row["room_type"] == "Shared room"):
        sums["shared"]+=row[sortBy]
        tally["shared"]+=1

# data_by_roomtype.apply(generate_bar_plot, axis=1)
for index, row in data_by_roomtype.iterrows():
    generate_bar_plot(row, "price") 

for k in sums:
    averages[k] = sums[k]/tally[k]
    
plt.bar(averages.keys(), averages.values())

plt.title("Rental Type vs Nightly Rate")
plt.ylabel("Nightly Rate")
plt.xlabel("Rental Type")
Out[5]:
Text(0.5, 0, 'Rental Type')
In [6]:
averages
Out[6]:
{'entire': 147.7998358348968,
 'private': 76.52815522800553,
 'shared': 56.74274905422446}

As expected, the rental price for booking an entire apartment or home is significantly more expensive than the other types of rooms.

In [96]:
#2 Room_type vs availability

for index, row in data_by_roomtype.iterrows():
    generate_bar_plot(row, "availability_365") 

for k in sums:
    averages[k] = sums[k]/tally[k]
    
plt.bar(averages.keys(), averages.values())
plt.title("Rental Type vs Availability")
plt.ylabel("Availability out of 365 days")
plt.xlabel("Rental Type")
Out[96]:
Text(0.5, 0, 'Rental Type')

Overall, it is apparent that booking an entire home or apartment is more available in NYC. This was a little surprising to us because our guess would have been that private room bookings would be more available (since we figured that travelling in smaller groups was more common).

In [97]:
averages
Out[97]:
{'entire': 125.42612570356472,
 'private': 95.98321625978812,
 'shared': 112.59268600252207}
In [7]:
#3 Map based on rating

# How to group by neighborhood group and then see the data?
m = folium.Map(location=[40.7128, -74.0060], zoom_start=11)
df = pandas.DataFrame(data)
df = df.sample(n = 1000)

for i in range(0, len(df)):

    if df.iloc[i]['number_of_reviews'] >= 0 and df.iloc[i]['number_of_reviews'] < 100:
        c = 'darkred'
    elif df.iloc[i]['number_of_reviews'] >= 100 and df.iloc[i]['number_of_reviews'] < 200:
        c = 'black'
    elif df.iloc[i]['number_of_reviews'] >= 200 and df.iloc[i]['number_of_reviews'] < 300:
        c = 'orange'
    elif df.iloc[i]['number_of_reviews'] >= 300 and df.iloc[i]['number_of_reviews'] < 400:
        c = 'white'
    elif df.iloc[i]['number_of_reviews'] >= 400 and df.iloc[i]['number_of_reviews'] < 500:
        c = 'green'
    elif df.iloc[i]['number_of_reviews'] >= 500 and df.iloc[i]['number_of_reviews'] < 600:
        c = 'blue'
    elif df.iloc[i]['number_of_reviews'] >= 600 and df.iloc[i]['number_of_reviews'] < 700:
        c = 'purple'
        
    folium.Circle(
        radius=5,
        location=[df.iloc[i]['latitude'], df.iloc[i]['longitude']],
        popup= df.iloc[i]['neighbourhood_group'],
        color=c,
        fill=False,
    ).add_to(m)
    
m 
Out[7]:

Overall, the number of reviews are between 0-100. There is not a huge apparent pattern between the location of the reviews and the number of reviews as we had thought there would be. However, from some random samples, it seems that there is a higher chance of more reviews being farther away from the actual city, such as the Queens area. To see if this is actually the case, we decided to plot data specific to Queens and Brooklyn (which we have shown below).

In [8]:
# Map of ratings for "Queens" neighbourhood 

m = folium.Map(location=[40.7128, -74.0060], zoom_start=11)
df = pandas.DataFrame(data)
df = df.loc[df['neighbourhood_group'] == "Queens"]
df = df.sample(n=1000)

for i in range(0, len(df)):

    if df.iloc[i]['number_of_reviews'] >= 0 and df.iloc[i]['number_of_reviews'] < 100:
        c = 'darkred'
    elif df.iloc[i]['number_of_reviews'] >= 100 and df.iloc[i]['number_of_reviews'] < 200:
        c = 'black'
    elif df.iloc[i]['number_of_reviews'] >= 200 and df.iloc[i]['number_of_reviews'] < 300:
        c = 'orange'
    elif df.iloc[i]['number_of_reviews'] >= 300 and df.iloc[i]['number_of_reviews'] < 400:
        c = 'white'
    elif df.iloc[i]['number_of_reviews'] >= 400 and df.iloc[i]['number_of_reviews'] < 500:
        c = 'green'
    elif df.iloc[i]['number_of_reviews'] >= 500 and df.iloc[i]['number_of_reviews'] < 600:
        c = 'blue'
    elif df.iloc[i]['number_of_reviews'] >= 600 and df.iloc[i]['number_of_reviews'] < 700:
        c = 'purple'
        
    folium.Circle(
        radius=5,
        location=[df.iloc[i]['latitude'], df.iloc[i]['longitude']],
        popup= df.iloc[i]['neighbourhood_group'],
        color=c,
        fill=False,
    ).add_to(m)
    
m 
Out[8]:

Still as before, the majority of reviews are between 0-100. However, on some of the random samples, it seemed like there is a little more green/blue (reviews between 400-600) than the overall map. However, with just a sample of 1000 it is hard to generalize.

In [9]:
# Map of ratings for "Brooklyn" neighbourhood 

m = folium.Map(location=[40.7128, -74.0060], zoom_start=11)
df = pandas.DataFrame(data)
df = df.loc[df['neighbourhood_group'] == "Brooklyn"]
df = df.sample(n=1000)

for i in range(0, len(df)):

    if df.iloc[i]['number_of_reviews'] >= 0 and df.iloc[i]['number_of_reviews'] < 100:
        c = 'darkred'
    elif df.iloc[i]['number_of_reviews'] >= 100 and df.iloc[i]['number_of_reviews'] < 200:
        c = 'black'
    elif df.iloc[i]['number_of_reviews'] >= 200 and df.iloc[i]['number_of_reviews'] < 300:
        c = 'orange'
    elif df.iloc[i]['number_of_reviews'] >= 300 and df.iloc[i]['number_of_reviews'] < 400:
        c = 'white'
    elif df.iloc[i]['number_of_reviews'] >= 400 and df.iloc[i]['number_of_reviews'] < 500:
        c = 'green'
    elif df.iloc[i]['number_of_reviews'] >= 500 and df.iloc[i]['number_of_reviews'] < 600:
        c = 'blue'
    elif df.iloc[i]['number_of_reviews'] >= 600 and df.iloc[i]['number_of_reviews'] < 700:
        c = 'purple'
        
    folium.Circle(
        radius=5,
        location=[df.iloc[i]['latitude'], df.iloc[i]['longitude']],
        popup= df.iloc[i]['neighbourhood_group'],
        color=c,
        fill=False,
    ).add_to(m)
    
m 
Out[9]:

The majority of reviews are between 0-100. It is interesting to note that while Queens had a slightly higher chance of showing the green or blue dots that showed 400-600 reviews, this area had a smaller proportion (indiciating that it might be less likely). Again, it is hard to make a definitive conclusion just look at the map.

In [101]:
#4 Map based on Price vs Area

data_price_sample = data.sample(n=1000)

m = folium.Map(location=[40.7128, -74.0060], zoom_start=14)
heat_data = []
for index, row in data_price_sample.iterrows():
    loc_price = []
    lat = row['latitude']
    long = row['longitude']
    price = row['price']
    loc_price.append(lat)
    loc_price.append(long)
    loc_price.append(price)
    heat_data.append(loc_price)
    
HeatMap(heat_data, max_val=10000).add_to(m)
m
Out[101]:

The heatmap shows that Airbnb room prices are higher in the neighborhoods of Manhattan and Brooklyn. The other three neighborhoods (Staten Island, Queens and Bronx) do not show as high of prices based on the heatmap.

In [102]:
#5 Map based on Available Airbnbs vs Area

data_price_sample = data.sample(n=1000)

m = folium.Map(location=[40.7128, -74.0060], zoom_start=14)
heat_data = []
for index, row in data_price_sample.iterrows():
    loc_availability = []
    lat = row['latitude']
    long = row['longitude']
    availability_365 = row['availability_365']
    loc_availability.append(lat)
    loc_availability.append(long)
    loc_availability.append(-availability_365)
    heat_data.append(loc_availability)
    
HeatMap(heat_data, max_val=0).add_to(m)
m
Out[102]:

The heatmap shows that Airbnb room availabilities are lower in the neighborhoods of Manhattan and Brooklyn. The other three neighborhoods (Staten Island, Queens and Bronx) are not as busy in terms of availability.

In [10]:
map_hooray = folium.Map(location=[40.7128, -74.0060],
                    zoom_start = 10) 

df_acc = data 
# Ensure you're handing it floats
df_acc['latitude'] = df_acc['latitude'].astype(float)
df_acc['longitude'] = df_acc['longitude'].astype(float)

# Filter the DF for rows, then columns, then remove NaNs
heat_df = df_acc[df_acc['reviews_per_month']>5] # Reducing data size so it runs faster# Reducing data size so it runs faster
heat_df = heat_df[['latitude', 'longitude', 'price']]

# Create weight column, using date
heat_df = heat_df.dropna(axis=0, subset=['latitude','longitude', 'price'])

# List comprehension to make out list of lists
heat_data = [[[row['latitude'],row['longitude']] for index, row in heat_df[heat_df["price"]>20*i][heat_df["price"]<20*(i+1)].iterrows()] for i in range(0, 50)]

# Plot it on the map
hm = plugins.HeatMapWithTime(heat_data,auto_play=True,max_opacity=0.7)
hm.add_to(map_hooray)
# Display the map
map_hooray

# heat_df
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:17: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
Out[10]:

The Heatmap/timeseries above plots the locations of airbnbs based on increasing price. The first frame plots Airbnbs with a nightly rate less than 20 dollars and then each subsequent frame goes up by 20 dollars. This helps us get an idea of where the cheaper and more expensive airbnbs might be.

Mean, Median, and Standard Deviation Analysis

We also chose to measure Mean, Median, and Standard Deviations to have a more simple yet informative statistical analysis to get more insight into some of the data including:

  1. Price
  2. Number of Reviews per Month
  3. Number of Reviews (overall)
  4. Availabilities
In [104]:
print("---------------Price-------------")
print("Mean: " + str(np.mean(data["price"])))
print("Median: " + str(np.median(data["price"])))
print("Standard Deviation: " + str(np.std(data["price"])))
---------------Price-------------
Mean: 110.60033506545135
Median: 99.0
Standard Deviation: 56.10137347569434

The mean price is around $112 per night which seems to be standard for a big city. As we saw in our analysis before, the prices of renting out an entire apartment or home is still more expensive than the other types of Airbnbs.

In [105]:
print("---------------Number of Reviews per Month-------------")
print("Mean: " + str(np.mean(data["reviews_per_month"])))
print("Median: " + str(np.median(data["reviews_per_month"])))
print("Standard Deviation: " + str(np.std(data["reviews_per_month"])))
---------------Number of Reviews per Month-------------
Mean: 1.3742527756481455
Median: 0.71
Standard Deviation: 1.6939495945203122
In [106]:
print("---------------Number of Reviews-------------")
print("Mean: " + str(np.mean(data["number_of_reviews"])))
print("Median: " + str(np.median(data["number_of_reviews"])))
print("Standard Deviation: " + str(np.std(data["number_of_reviews"])))
---------------Number of Reviews-------------
Mean: 29.962745265070847
Median: 10.0
Standard Deviation: 49.09259910719924

We expected there to be more reviews for each Airbnb (because we figured that travelling to NYC was common and that people would be leaving more informative reviews for each Airbnb). An interesting thing to test to find out more information about these reviews would be to determine if the reviews were mainly positive or negative, and to see if this had any relationship to the neighbourhood area or price of the Airbnb.

In [107]:
print("---------------Availability-------------")
print("Mean: " + str(np.mean(data["availability_365"])))
print("Median: " + str(np.median(data["availability_365"])))
print("Standard Deviation: " + str(np.std(data["availability_365"])))
---------------Availability-------------
Mean: 110.63318851690944
Median: 48.0
Standard Deviation: 128.35365520252995

Airbnb's are generally available for 1/3 of the entire year! As further analysis, it would be intersting to determine if there are certain times during the year where Airbnb's happen to be the most available as compared to other times. For example, are they more available during the holiday season?

More Statistical Analysis Below

In [108]:
neighbourhood_averages = {}

for row in data.iterrows():
    if row[1][5] in neighbourhood_averages:
        neighbourhood_averages[row[1][1]][0]+=1
        neighbourhood_averages[row[1][1]][1]+=row[1][9]
    else:
        neighbourhood_averages[row[1][1]] = [1, row[1][9]]
neighbourhood_averages
Out[108]:
{'Kensington': [1, 53],
 'Midtown': [1, 7],
 'Clinton Hill': [1, 31],
 'East Harlem': [1, 42],
 'Murray Hill': [1, 2],
 'Bedford-Stuyvesant': [1, 14],
 "Hell's Kitchen": [1, 98],
 'Upper West Side': [1, 15],
 'Chinatown': [1, 215],
 'South Slope': [1, 9],
 'West Village': [1, 8],
 'Williamsburg': [1, 1],
 'Fort Greene': [1, 29],
 'Chelsea': [1, 7],
 'Crown Heights': [1, 89],
 'Park Slope': [1, 8],
 'Windsor Terrace': [1, 0],
 'Inwood': [1, 11],
 'East Village': [1, 326],
 'Harlem': [1, 65],
 'Greenpoint': [1, 43],
 'Bushwick': [1, 1],
 'Lower East Side': [1, 13],
 'Prospect-Lefferts Gardens': [1, 25],
 'Long Island City': [1, 334],
 'Kips Bay': [1, 41],
 'SoHo': [1, 203],
 'Upper East Side': [1, 147],
 'Prospect Heights': [1, 89],
 'Washington Heights': [1, 68],
 'Woodside': [1, 365],
 'Flatbush': [1, 14],
 'Carroll Gardens': [1, 21],
 'Gowanus': [1, 364],
 'Flatlands': [1, 149],
 'Cobble Hill': [1, 43],
 'Flushing': [1, 339],
 'Sunnyside': [1, 188],
 'DUMBO': [1, 0],
 'St. George': [1, 201],
 'Highbridge': [1, 2],
 'Financial District': [1, 181],
 'Morningside Heights': [1, 3],
 'Jamaica': [1, 176],
 'Middle Village': [1, 161],
 'Ridgewood': [1, 133],
 'NoHo': [1, 36],
 'Ditmars Steinway': [1, 26],
 'Roosevelt Island': [1, 61],
 'Greenwich Village': [1, 169],
 'Little Italy': [1, 300],
 'East Flatbush': [1, 357],
 'Tompkinsville': [1, 84],
 'Astoria': [1, 165],
 'Eastchester': [1, 365],
 'Kingsbridge': [1, 84],
 'Boerum Hill': [1, 264],
 'Brooklyn Heights': [1, 0],
 'Two Bridges': [1, 161],
 'Queens Village': [1, 19],
 'Rockaway Beach': [1, 162],
 'Forest Hills': [1, 288],
 'Nolita': [1, 76],
 'Woodlawn': [1, 29],
 'University Heights': [1, 191],
 'Allerton': [1, 175],
 'East New York': [1, 361],
 'Theater District': [1, 39],
 'Concourse Village': [1, 364],
 'Sheepshead Bay': [1, 250],
 'Emerson Hill': [1, 38],
 'Fort Hamilton': [1, 322],
 'Bensonhurst': [1, 18],
 'Tribeca': [1, 66],
 'Shore Acres': [1, 0],
 'Sunset Park': [1, 311],
 'Concourse': [1, 20],
 'Gramercy': [1, 160],
 'Elmhurst': [1, 23],
 'Brighton Beach': [1, 353],
 'Jackson Heights': [1, 326],
 'Cypress Hills': [1, 82],
 'St. Albans': [1, 167],
 'Arrochar': [1, 81],
 'Rego Park': [1, 1],
 'Wakefield': [1, 62],
 'Clifton': [1, 312],
 'Bay Ridge': [1, 356],
 'Spuyten Duyvil': [1, 326],
 'Stapleton': [1, 179],
 'Briarwood': [1, 208],
 'Ozone Park': [1, 52],
 'Columbia St': [1, 13],
 'Vinegar Hill': [1, 303],
 'Mott Haven': [1, 40],
 'Longwood': [1, 14],
 'Canarsie': [1, 360],
 'Battery Park City': [1, 339],
 'East Elmhurst': [1, 358],
 'New Springville': [1, 4],
 'Morris Heights': [1, 65],
 'Arverne': [1, 362],
 'Gravesend': [1, 222],
 'Mariners Harbor': [1, 140],
 'Concord': [1, 68],
 'Borough Park': [1, 25],
 'Downtown Brooklyn': [1, 6],
 'Flatiron District': [1, 6],
 'Civic Center': [1, 327],
 'Port Morris': [1, 88],
 'Fieldston': [1, 192],
 'Kew Gardens': [1, 305],
 'Midwood': [1, 86],
 'Mount Eden': [1, 0],
 'City Island': [1, 18],
 'Glendale': [1, 90],
 'Red Hook': [1, 325],
 'Richmond Hill': [1, 365],
 'Maspeth': [1, 21],
 'Port Richmond': [1, 365],
 'Williamsbridge': [1, 47],
 'Soundview': [1, 365],
 'Woodhaven': [1, 359],
 'Co-op City': [1, 365],
 'Stuyvesant Town': [1, 321],
 'Parkchester': [1, 68],
 'North Riverdale': [1, 174],
 'Dyker Heights': [1, 89],
 'Bronxdale': [1, 194],
 'Riverdale': [1, 52],
 'Kew Gardens Hills': [1, 160],
 'Bay Terrace': [1, 169],
 'Norwood': [1, 271],
 'Claremont Village': [1, 88],
 'Fordham': [1, 175],
 'Bayswater': [1, 90],
 'Navy Yard': [1, 0],
 'Brownsville': [1, 35],
 'Eltingville': [1, 291],
 'Mount Hope': [1, 167],
 'Clason Point': [1, 86],
 'Lighthouse Hill': [1, 71],
 'Springfield Gardens': [1, 89],
 'Howard Beach': [1, 78],
 'Belle Harbor': [1, 52],
 'Jamaica Estates': [1, 43],
 'Van Nest': [1, 342],
 'Bellerose': [1, 342],
 'Bayside': [1, 345],
 'Morris Park': [1, 343],
 'West Brighton': [1, 80],
 'College Point': [1, 212],
 'Far Rockaway': [1, 107],
 'South Ozone Park': [1, 327],
 'Tremont': [1, 146],
 'Corona': [1, 337],
 'Great Kills': [1, 87],
 'Manhattan Beach': [1, 215],
 'Marble Hill': [1, 52],
 'Dongan Hills': [1, 310],
 'Fresh Meadows': [1, 178],
 'East Morrisania': [1, 89],
 'Hunts Point': [1, 59],
 'Pelham Bay': [1, 336],
 'Randall Manor': [1, 342],
 'West Farms': [1, 310],
 'Silver Lake': [1, 0],
 'Laurelton': [1, 70],
 'Grymes Hill': [1, 44],
 'Holliswood': [1, 135],
 'Pelham Gardens': [1, 354],
 'Rosedale': [1, 89],
 'Edgemere': [1, 363],
 'New Brighton': [1, 10],
 'Baychester': [1, 46],
 'Melrose': [1, 0],
 'Sea Gate': [1, 180],
 'Bergen Beach': [1, 164],
 'Cambria Heights': [1, 151],
 'Richmondtown': [1, 300],
 'Throgs Neck': [1, 365],
 'Howland Hook': [1, 363],
 'Schuylerville': [1, 343],
 'Coney Island': [1, 129],
 "Prince's Bay": [1, 66],
 'South Beach': [1, 176],
 'Bath Beach': [1, 90],
 'Midland Beach': [1, 231],
 'Jamaica Hills': [1, 139],
 'Castleton Corners': [1, 40],
 'Oakwood': [1, 364],
 'Castle Hill': [1, 42],
 'Douglaston': [1, 143],
 'Huguenot': [1, 259],
 'Whitestone': [1, 5],
 'Edenwald': [1, 34],
 'Belmont': [1, 359],
 'Grant City': [1, 188],
 'Westerleigh': [1, 36],
 'Tottenville': [1, 299],
 'Morrisania': [1, 90],
 'Bay Terrace, Staten Island': [1, 0],
 'Westchester Square': [1, 355],
 'Little Neck': [1, 88],
 'Rosebank': [1, 179],
 'Mill Basin': [1, 322],
 'Hollis': [1, 89],
 'Arden Heights': [1, 55],
 "Bull's Head": [1, 362],
 'Olinville': [1, 188],
 'Neponsit': [1, 44],
 'Graniteville': [1, 0],
 'Unionport': [1, 365],
 'Rossville': [1, 59],
 'Breezy Point': [1, 59],
 'Willowbrook': [1, 351],
 'New Dorp Beach': [1, 307],
 'Todt Hill': [1, 36]}
In [109]:
data
Out[109]:
neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews reviews_per_month availability_365
0 Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 0.21 365
1 Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 0.38 355
3 Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 4.64 194
4 Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 0.10 0
5 Manhattan Murray Hill 40.74767 -73.97500 Entire home/apt 200 3 74 0.59 129
... ... ... ... ... ... ... ... ... ... ...
48782 Manhattan Upper East Side 40.78099 -73.95366 Private room 129 1 1 1.00 147
48790 Queens Flushing 40.75104 -73.81459 Private room 45 1 1 1.00 339
48799 Staten Island Great Kills 40.54179 -74.14275 Private room 235 1 1 1.00 87
48805 Bronx Mott Haven 40.80787 -73.92400 Entire home/apt 100 1 2 2.00 40
48852 Brooklyn Bushwick 40.69805 -73.92801 Private room 30 1 1 1.00 1

35217 rows × 10 columns

In [110]:
manhattan_mean = data[data["neighbourhood_group"]=="Manhattan"].price.mean()
brooklyn_mean = data[data["neighbourhood_group"]=="Brooklyn"].price.mean()
queens_mean = data[data["neighbourhood_group"]=="Queens"].price.mean()
staten_mean = data[data["neighbourhood_group"]=="Staten Island"].price.mean()
bronx_mean = data[data["neighbourhood_group"]=="Bronx"].price.mean()
In [111]:
def replace_neighbourhood(n):
    if(n == "Manhattan"): 
        return manhattan_mean
    elif(n == "Brooklyn"): 
        return brooklyn_mean
    elif(n == "Queens"): 
        return queens_mean
    elif(n == "Staten Island"): 
        return staten_mean
    elif(n == "Bronx"): 
        return bronx_mean
    
def replace_type(t):
    if(t == "Entire home/apt"):
        return averages["entire"]
    elif(t == "Private room"):
        return averages["private"]
    elif(t == "Shared room"):
        return averages["shared"]
In [112]:
pre = data.drop(['minimum_nights','availability_365'], axis=1)
pre.neighbourhood_group = pre["neighbourhood_group"].apply(replace_neighbourhood)
pre.room_type = pre["room_type"].apply(replace_type)
In [113]:
neighbourhood_averages2 = {}
for k in neighbourhood_averages:
    neighbourhood_averages2[k] = neighbourhood_averages[k][1]/neighbourhood_averages[k][0]
In [114]:
def neighbourhood_change(n):
    global neighbourhood_averages2
    return neighbourhood_averages2[n]

pre.neighbourhood = pre["neighbourhood"].apply(neighbourhood_change)

pre
Out[114]:
neighbourhood_group neighbourhood latitude longitude room_type price number_of_reviews reviews_per_month
0 101.481853 53.0 40.64749 -73.97237 95.983216 149 9 0.21
1 131.420141 7.0 40.75362 -73.98377 125.426126 225 45 0.38
3 101.481853 31.0 40.68514 -73.95976 125.426126 89 270 4.64
4 131.420141 42.0 40.79851 -73.94399 125.426126 80 9 0.10
5 131.420141 2.0 40.74767 -73.97500 125.426126 200 74 0.59
... ... ... ... ... ... ... ... ...
48782 131.420141 147.0 40.78099 -73.95366 95.983216 129 1 1.00
48790 84.539873 339.0 40.75104 -73.81459 95.983216 45 1 1.00
48799 83.415282 87.0 40.54179 -74.14275 95.983216 235 1 1.00
48805 73.342012 40.0 40.80787 -73.92400 125.426126 100 2 2.00
48852 101.481853 1.0 40.69805 -73.92801 95.983216 30 1 1.00

35217 rows × 8 columns

Violin Plot

In [115]:
violin = sns.violinplot(data=data, x='neighbourhood_group', y='price')
violin.set_title('Density and distribution of prices for each neighberhood_group')
Out[115]:
Text(0.5, 1.0, 'Density and distribution of prices for each neighberhood_group')

The Violin plots above show the price distribution of Airbnbs in the 5 bouroughs of NYC. All of the bouroughs, except for Manhattan, are clearly unimodal with the modes falling in the 40-60 dollar range. Manhattan, however, has a much more even distribution suggesting that the prices are going to be much more expensive, on average, than in the other bouroughs. It is important that our model takes this into account!

Part 4 - Classification

In [116]:
data_x = pre[["neighbourhood", "neighbourhood_group", "latitude", "longitude"]]
y = pre.price

X_train, X_test, y_train, y_test = train_test_split(data_x, y, test_size=0.2)


reg = linear_model.LinearRegression()
reg.fit(X_train, y_train)

predicts = reg.predict(X_test)

print("""
        Mean Absolute Error: {}
        Root Mean Squared Error: {}
        R2 Score: {}
     """.format(mean_absolute_error(y_test,predicts),np.sqrt(mean_squared_error(y_test, predicts)),r2_score(y_test,predicts),))
        Mean Absolute Error: 42.6601773417156
        Root Mean Squared Error: 52.28324054988423
        R2 Score: 0.13169825206150432
     

For the first linear regression we used 4 features. These are all the features that are related to geographic location since location is often a huge factor when it comes to determining a price for a home or apartment. While the MAE and RMSE were very reasonable but the R^2 score was only 0.14 which is not desireable. This suggested we needed to add more features if we are to continue with linear regression

In [117]:
data_x = pre[["neighbourhood", "latitude", "longitude"]]
y = pre.price

X_train, X_test, y_train, y_test = train_test_split(data_x, y, test_size=0.2)


reg = linear_model.LinearRegression()
reg.fit(X_train, y_train)

predicts = reg.predict(X_test)

print("""
        Mean Absolute Error: {}
        Root Mean Squared Error: {}
        R2 Score: {}
     """.format(mean_absolute_error(y_test,predicts),np.sqrt(mean_squared_error(y_test, predicts)),r2_score(y_test,predicts),))

a = reg.coef_
i = reg.intercept_
        Mean Absolute Error: 44.00267826102187
        Root Mean Squared Error: 53.824499597096484
        R2 Score: 0.09739233274730563
     

We first tried to remove the neighbourhood features since neighbourhood and neighbourhood_group are related. In fact, the neighbourhood group is just a more precise version of the neighbourhood. However, the R^2 value only decreased so this was not a good move. From then on we decided to include both the neithbourhood and the neighbourhood group.

In [118]:
data_x = pre.drop(["price"], axis=1)
y = pre.price

X_train, X_test, y_train, y_test = train_test_split(data_x, y, test_size=0.2)


reg = linear_model.LinearRegression()
reg.fit(X_train, y_train)

predicts = reg.predict(X_test)

print("""
        Mean Absolute Error: {}
        Root Mean Squared Error: {}
        R2 Score: {}
     """.format(mean_absolute_error(y_test,predicts),np.sqrt(mean_squared_error(y_test, predicts)),r2_score(y_test,predicts),))
        Mean Absolute Error: 31.43313104269536
        Root Mean Squared Error: 40.93980434797532
        R2 Score: 0.46770082510128985
     

Next, we simply tried to include all the features as there is likely some correlations between the availability, reviews, etc... on the price of an airbnb. This would allow the linear regression to have as meany features as possible and adjust the weights based on them. We were afraid that including all the features may not help more than the geographic features alone but the R^2 value jumped all the way up to 0.48.

In [119]:
data_x = pre.drop(["price", "number_of_reviews", "reviews_per_month"], axis=1)
y = pre.price

X_train, X_test, y_train, y_test = train_test_split(data_x, y, test_size=0.2)

reg = linear_model.LinearRegression()
reg.fit(X_train, y_train)

predicts = reg.predict(X_test)

print("""
        Mean Absolute Error: {}
        Root Mean Squared Error: {}
        R2 Score: {}
     """.format(mean_absolute_error(y_test,predicts),np.sqrt(mean_squared_error(y_test, predicts)),r2_score(y_test,predicts),))

a = reg.coef_
i = reg.intercept_
        Mean Absolute Error: 32.01262714429472
        Root Mean Squared Error: 41.337224159693264
        R2 Score: 0.4672694973705521
     

Lastly, we felt that the review based features may not be accurate predictors considering that an airbnb may have a lot of reviews for bring really good or for being really bad. This would suggest that it would not be a great predictor so we tried a linear regression without these features and the statistics essentially remained the same. This suggests that number and frequency of reviews are not a great feature to use for this particular dataset.

Conclusion

Through our data and analysis we can infer that the price of an Airbnb room is higher in Manhattan than the other 4 neighborhoods (Brooklyn, Queens, Staten Island, and Bronx). This could be due to higher tourist activities in Manhattan.

Our data and analysis can be used to help people, such as travelers, find places to stay in New York City that meet their preferences in terms of neighborhood, availability, and price. It can also be used to predict the price of an Airbnb room that meets their preference.

For future research we plan on predicting the popularity of an Airbnb room based on its attributes such as room type, neighborhood and availability. This analysis could be beneficial for audiences that want to list their homes in the New York City area and want to know the success rate.

To improve on our data and analysis we can add more attributes that would provide better insights and linear regression predictions. For example, we can include crime rate in each neighborhood group and neighborhood to see if it has an effect on the price and popularity of an Airbnb room. We can also include the exact square footage of an Airbnb room to better predict it's price value.

We hope this analysis helped learn more about Airbnb's in NYC! :)

In [ ]: